Automating the Acquisition of Bilingual Terminology
نویسنده
چکیده
As the acquisition problem of bilingual lists of terminological expressions is formidable, it is worthwhile to investigate methods to compile such lists as automatically as possible. In this paper we discuss experimental results for a number of methods, which operate on corpora of previously translated texts. K e y w o r d s : parallel corpora, tagging, terminology acquisition. 1 I n t r o d u c t i o n In the past several years, many researchers have started looking at bilingual corpora, as they implicitly contain much information needed for various purposes that would otherwise have to be compiled manually. Some applications using information extracted from bilingual corpora are statistical MT ([Brown et al., 1990]), bilingual lexicography ([Catizone el al., 1989]), word sense disambiguation ([Gale et al., 1992]), and multilingual information retrieval ([Landauer and Littmann, 1990]). The goal of the research discussed in this paper is to automate as much as possible the generation of bilingual term lists from previously translated texts. These lists are used by terminologists and translators, e.g. in documentation departments. Manual compilation of bilingual term lists is an expensive and laborious effort, hence the relative rarity of specialized, up-to-date, and manageable terminological data collections. However, organizations interested in terminology and translation are likely to have archives of previously translated documents, which represent a considerable investment. Automatic or semi-automatic extraction of the information contained in these documents would then be an attractive perspective. A bilingual term list is a list associating source language terms with a ranked list of target language terms. The methods to extract bilingual terminology from parallel texts were developed and evaluated experimentally using a bilingual, Dutch-English corpus. There are two phases in the process: 1. Process the texts to extract terms. The definition of the notion ' term' will be an important issue of this paper, as it is necessary to adopt a definition that facilitates comparison of terms in the source and target language. Section 4 will show some flaws of methods that define terms as words or nouns. Terminologists commonly use full noun phrases 1 as terms to express (domainspecific) concepts. The NP level is shown to be a better level to compare Dutch and English in sections 5.1 and 5.2. This phase acts as a linguistic front end to the second phase. The various techniques used to process the corpus are described in section 2. 2. Apply statistic techniques to determine correspondences between source and target language. In section 3 we will introduce a simple algorithm to select and order potential translations for a given term. This method will subsequently be compared to two other methods discussed in the literature. The usual benefits of modularity apply because the two phases are highly independent. 1To some extent, a particular domain will also have textual elements specific to the domain that are not NPs. We will ignore these, but essentially the same methods could be used to create bilingual lists of e.g. verbs.
منابع مشابه
Automating the Acquisition of Bilingual Terminology
As the acquisition problem of bilingual lists of terminological expressions is formidable, it is worthwhile to investigate methods to compile such lists as automatically as possible. In this paper we discuss experimental results for a number of methods, which operate on corpora of previously translated texts. K e y w o r d s : parallel corpora, tagging, terminology acquisition. 1 I n t r o d u ...
متن کاملLearning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach
Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper will seek to present an approach to bilingual lexicon extraction from non-aligned comparable corpora, combination to linguisticsbased pruning a...
متن کاملBilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval
The present paper will seek to present an approach to bilingual lexicon extraction from non-aligned comparable corpora, phrasal translation as well as evaluations on Cross-Language Information Retrieval. A two-stages translation model is proposed for the acquisition of bilingual terminology from comparable corpora, disambiguation and selection of best translation alternatives according to their...
متن کاملPrimary Data Encoding of a Bilingual Corpus
This paper discusses the building of a bilingual corpus of legal and administrative texts, focusing on the encoding of documentation and structural information according to the Corpus Encoding Standard. The corpus is one module in an ongoing research project about (semi-)automatic terminology acquisition at the European Academy Bolzano and will serve as a basis for applying term extraction prog...
متن کاملThe relationship between second language acquisition and mathematics accomplishment among second graders
Introduction: Study of bilingualism will enhance the understanding of the cognitive and neural mechanisms responsible for learning. Cognitive correlates of bilingualism such as enhancement of attention control, problem solving and working memory would be worth studying especially among young children to improve their future performances. Among the wide range of advantages of bilingualism, worki...
متن کامل